Flexible Data Analysis Pipeline for High-Confidence Proteogenomics

نویسندگان

Hendrik Weisser

James C. Wright

Jonathan M. Mudge

Petra Gutenbrunner

Jyoti S. Choudhary

چکیده

Proteogenomics leverages information derived from proteomic data to improve genome annotations. Of particular interest are "novel" peptides that provide direct evidence of protein expression for genomic regions not previously annotated as protein-coding. We present a modular, automated data analysis pipeline aimed at detecting such "novel" peptides in proteomic data sets. This pipeline implements criteria developed by proteomics and genome annotation experts for high-stringency peptide identification and filtering. Our pipeline is based on the OpenMS computational framework; it incorporates multiple database search engines for peptide identification and applies a machine-learning approach (Percolator) to post-process search results. We describe several new and improved software tools that we developed to facilitate proteogenomic analyses that enhance the wealth of tools provided by OpenMS. We demonstrate the application of our pipeline to a human testis tissue data set previously acquired for the Chromosome-Centric Human Proteome Project, which led to the addition of five new gene annotations on the human reference genome.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, high-throughput batch clusters and multicore workstations

SUMMARY We present the first public release of our proteogenomic annotation pipeline. We have previously used our original unreleased implementation to improve the annotation of 46 diverse prokaryotic genomes by discovering novel genes, post-translational modifications and correcting the erroneous annotations by analyzing proteomic mass-spectrometry data. This public version has been redesigned...

متن کامل

The influence of transcript assembly on the proteogenomics discovery of microproteins

Proteogenomics methods have identified many non-annotated protein-coding genes in the human genome. Many of the newly discovered protein-coding genes encode peptides and small proteins, referred to collectively as microproteins. Microproteins are produced through ribosome translation of small open reading frames (smORFs). The discovery of many smORFs reveals a blind spot in traditional gene-fin...

متن کامل

N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana *

Proteogenomics is an emerging research field yet lacking a uniform method of analysis. Proteogenomic studies in which N-terminal proteomics and ribosome profiling are combined, suggest that a high number of protein start sites are currently missing in genome annotations. We constructed a proteogenomic pipeline specific for the analysis of N-terminal proteomics data, with the aim of discovering ...

متن کامل

Flexible and Accessible Workflows for Improved Proteogenomic Analysis Using the Galaxy Framework

Proteogenomics combines large-scale genomic and transcriptomic data with mass-spectrometry-based proteomic data to discover novel protein sequence variants and improve genome annotation. In contrast with conventional proteomic applications, proteogenomic analysis requires a number of additional data processing steps. Ideally, these required steps would be integrated and automated via a single s...

متن کامل

Methods, Tools and Current Perspectives in Proteogenomics.

With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More rece...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 15 شماره

صفحات -

تاریخ انتشار 2016

Flexible Data Analysis Pipeline for High-Confidence Proteogenomics

نویسندگان

چکیده

منابع مشابه

PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, high-throughput batch clusters and multicore workstations

The influence of transcript assembly on the proteogenomics discovery of microproteins

N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana *

Flexible and Accessible Workflows for Improved Proteogenomic Analysis Using the Galaxy Framework

Methods, Tools and Current Perspectives in Proteogenomics.

عنوان ژورنال:

اشتراک گذاری